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Abstract 

We propose a new yet natural algorithm for learning the graph structure of general discrete 
graphical models (a.k.a. Markov random fields) from samples. Our algorithm finds the 
neighborhood of a node by sequentially adding nodes that produce the largest reduction in 
empirical conditional entropy; it is greedy in the sense that the choice of addition is based 
only on the reduction achieved at that iteration. Its sequential nature gives it a lower 
computational complexity as compared to other existing comparison-based techniques, all 
of which involve exhaustive searches over every node set of a certain size. Our main result 
characterizes the sample complexity of this procedure, as a function of node degrees, graph 
size and girth in factor-graph representation. We subsequently specialize this result to the 
case of Ising models, where we provide a simple transparent characterization of sample 
complexity as a function of model and graph parameters. 

For tree graphs, our algorithm is the same as the classical Chow-Liu algorithm, and in 
that sense can be considered the extension of the same to graphs with cycles. 



1. Introduction 

Markov Random Fields (MRF), or undirected graphical models, encode conditional in- 
dependence relations between random variables. Depending on the application at hand, 
nodes of a graphical model may represent people, genes, languages, processes, etc., while 
the graphical model illustrates certain conditional dependencies among them (for example, 
influence in a social network, physiological functionality in genetic networks, etc.). Often 
the knowledge of the underlying graph is not available beforehand, but must be inferred 
from certain observations of the system. In mathematical terms, these observations corre- 
spond to samples drawn from the underlying distribution. Thus, the core task of structure 
learning is that of inferring conditional dependencies between random variables from i.i.d 
samples drawn from their joint distribution. The importance of the MRF in understanding 
the underlying system makes structure learning an important primitive for studying such 
systems. 

This paper proposes a new yet natural method to infer the graph structure of an MRF 
from samples, and analytically characterizes its sample complexity in terms of graph and 



The results in this paper were presented in (Netrapalli et al.l 20101 without proofs of the theorems. This 



paper includes all the proofs along with simulations. 
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model parameters. Our algorithm is based on the fact that the graph neighborhood of a 
node is also its Markov blanket, and conditioned on it the node's variable is independent 
of all others. We build this neighborhood in a greedy fashion, by sequentially adding the 
nodes that give the biggest reductions in conditional entropy. Our analytical results - both 
for general models and Ising models - require lower bounds on the girth of the graph. In 
practice - for both synthetic examples and a real dataset drawn from senate voting records 
- our algorithm is seen to perform quite well even for graphs with lots of small cycles. 

Our algorithm has lower computational complexity as compared to other algorithms 
that are not tailored to specific model classes (note that if we know a-priori that we are 
looking for an Ising model, or a Gaussian one, faster methods exist). We review and compare 
our algorithms to existing literature below. We also elaborate on the sense in which our 



algorithm can be thought of as an extension of the Chow-Liu algorithm (Chow and Liu 



1968) to graphs with cycles. 



The remaining sections are organized as follows. In Section[2j we review graphical models 
and some results from information theory, and set up the structure learning problem. Our 
new structure learning algorithm, GreedyAlgorithm(e), is given in Section [3j Next, in 
Section |4j we develop a sufficient condition for the correctness of the algorithm for general 
graphs. To demonstrate the applicability of this condition, we translate it into equivalent 
conditions for learning an Ising model in Section [5] We present simulation results evaluating 
our algorithm in Section [6| We discuss future work and conclude in Section [7} The proofs 
of theorems are in the Appendix. 



1.1 Related Work 



Learning the structure of graphical models is a well-established problem; existing work 
falls into two broad categories. The first category involves methods tailored for a specific 
parametric form of the probability distribution. In particular, when a parametric family is 
known, the (log) likelihood of the data is written as a function (often convex) of the param- 
eters of the distribution; this likelihood is then maximized, often with added regularizers 
like an l\ penalty, to find the parameters and hence the graph structure. Examples in this 



category include ( 


Ravikumar et al. 


2011 


El Karoui 


2008 


Furrer and Bengtsson 


2007; 


Zhou et al. 


2010, 


Anandkumar and Tan 


201 la I for Gaussian graphical models, ( 


Raviku- 



mar et al. 



2010 



Banerjee et al. 2008, Santhanam and Wainwright, 2009) for Ising models, 



(Jalali et al. 2011) for general discrete pairwise graphical models. 

The other category of graphical model structure learning algorithms are those that do 
not need to assume (and cannot leverage) specific parametric forms of the distribution. 
Rather, they are based on the notion that a node's Markov blanket, i.e. its neighborhood 
in the graphical model, makes a node conditionally independent of other nodes. Examples 



of such algorithms include (Chow and Liu 



1968 



Abbeel et al., 2006, Bresler et al., 2008 



Anandkumar and Tan, 2011b Bento and Montanari 2009). All of these methods involve 
an exhaustive search over all subsets of nodes upto a certain size d - typically the degree of 
the node whose neighborhood we are trying to find. This results in a high computational 
complexity for the algorithms. 

This paper falls into the latter category, but avoids the high computational complexity 
of exhaustive searching by building the sets in a greedy fashion instead. 
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Finally, we note that if the graph is a tree, our method is equivalent to the classical 



Chow-Liu method (Chow and Liu, 1968). In particular, (Chow and Liu, 1968) involves 



making a max-weight spanning tree where the edge weights are mutual information. From 
any given fixed node's perspective, this algorithm adds edges in the same order as our 
algorithm; i.e. greedily adding nodes that give the biggest reduction in conditional entropy. 



In that sense, our algorithm can be considered a generalization of (Chow and Liu, 1968) to 
graphs with cycles. 



2. Preliminaries 

We now setup the (standard) graphical model structure learning problem. Let X be a p- 
dimensional random vector {Xi, X2, • • • , X p }, where each component X{ of X takes values in 
a finite set X . We use the shorthand notation P(xi) = f(X% = Xi),Xi G X, and similarly for 
a set A C {1,2,..., p}, we define P(x A ) = P{X A = x A ),x a G X\ A \ where X A = {X t \i G A}. 

Let G be the Markov graph of X, with vertex set V (one node i G V for each variable 
Xi), and edge set E. In particular, this means that the probability distribution of X 
satisfies the local markov property (Lauritzen, 1996) with respect to G: for every i G V, 
if its neighborhood in G is N(i), then for any set B G V \ {i} U N(i), we have that 
P{xi\x Ni i),x B ) = P{x i \x N (i ) ) for all (xi,x N (^,x B ). 

Our goal is to learn the structure of G - i.e. the set of its edges E - from n vector samples 



x 



(i) 



, x 



(n) 



, which are drawn iid from the joint distribution. The empirical distribution P 



is defined as follows; for any set A of variables (nodes) and corresponding values x A , 



P(x A ) 



n 



{x\>-- 

i=l 



-xa} 



The empirical entropy and conditional entropy refer to the corresponding quantities for this 
empirical distribution P. In this paper we will refer to the true entropies by H and the 
empirical ones by H. The following fact is immediate from conditional independence and 



the Data Processing Inequality, see (Cover and Thomas, 2006). 



Proposition 1 For any node i G V, its neighborhood N(i) in G, and any set A C V \ {i}, 
we have that 

H(Xi\X N(i) ) < H(Xi\X A ), 



Motivated by this relationship, (Abbeel et al. , |2006 ) advocated finding N(i) by exhaustive 
searching over all sets of size less than d, where d is the (upper bound on the) degree of node 
i. Our method avoids this exhaustive search, but builds the neighborhood in a sequential 
greedy fashion. 

Of course, any algorithm would need to work with samples, which in our case would 
be empirical entropies. We find the following result - obtained by combining Theorem 
16.3.2 and Lemma 16.3.1 from (Cover and Thomas 2006) - useful in translating between 



conditions on the true and empirical entropy quantities. 



Proposition 2 Let P and Q be two probability mass functions in a finite set X , with 
entropies H(P) and H{Q) respectively, and with total variational distance \ \P — Q\\i given 
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by: 

||P-Q||i = ^|P(x)-Q(x)|. 

Then 

\H(P) - H(Q)\ < -||P - QlUlog l|P ~ (1) 
Further, if the relative entropy between them is given by D(P||Q), then 

^(PIIQ^^IIP-Qlli (2) 

We characterize the sample complexity of our algorithm for a class of graphs and mod- 
els. We specify these models in terms of their factor graph, which we define below for 
completeness. 

Definition 1 (Factor Graph) Given a graphical model G(V,E) its factor graph is a bi- 
partite graph Gj with vertex set V U C where each vertex c £ C corresponds to a maximal 
clique in G. For any v E V and c £ C, there is an edge {v,c} in Gf if and only if v £ c in 
G. 

We have the following simple lemma relating the distance between two nodes i,j 6 V in 
the graphs G and Gf. 

Lemma 1 Given a graph G, let G j be its factor graph. Then for every i,j G V we have 
df(i,j) = 2d(i,j) where d anddf are the distances between i and j inG andGf respectively. 



3. The Greedy Algorithm(e) Structure Learning Algorithm 

We now present our method, Greedy Algorithm(e), which proceeds by finding the Markov 
neighborhood of each node separately. For the neighborhood of node i, it starts with an 
empty set and iteratively adds nodes that bring the largest additional decrease in (empirical) 
conditional entropy. It stops when this decrease is less than e/2. The formal specification 
is presented in Algorithm [TJ 



4. Sufficient Conditions for General Discrete Graphical Models 

In this section, we provide guarantees for general discrete graphical models, under which 
Greedy Algorithm(e) recovers the graphical model structure exactly. First, using an exam- 
ple, we build up intuition for the sufficient conditions, and define two key notions: non- 



degeneracy conditions and correlation decay. Our main result is presented in Section 4.2 



wherein we give a sufficient condition for the correctness of the algorithm in general discrete 
graphical models. 



4.1 Non-Degeneracy and Correlation Decay 

Before analyzing the correctness of structure learning from samples, a simpler problem worth 
considering is one of algorithm consistency, i.e., does the algorithm succeed to identify the 
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Algorithm 1 Greedy Algorithm(e) 

1: for i £ V do 

2: complete <- FALSE 

3: N(i) <- $ 

4: while ! complete do 

5: j = argminiJ(Xj | Xfi,~,X k ) 

k£V\N{i) 

6: if H{Xi | Xfi^yXj) < H{Xi I A %) ) - f then 

7: JV(i) <- A>(i)U{j} 

8: else 

9: complete <- TRUE 

10: end if 

11: end while 

12: end for 



true graph given the true conditional distributions (or in other words, given an infinite 
number of samples). It turns out that the algorithm as presented in Algorithm [TJ does not 
even possess this property, as is illustrated by the following counter-example 

Let V = {0, 1, • • • , D, D+1}, Xi € {-1, l}Vi € V and E = {{0, »}, {i, D + 1} | 1 < i < D}. 
Let P(x v ) = \ ]"[ e° XiX ', where Z is a normalizing constant (this is the classical zero- 

field Ising model potential). The graph is shown in Fig. [TJ 




Figure 1: An example of adding spurious nodes: Execution of GreedyAlgorithm(e) for node 
adds node D + 1 in the first iteration, even though it is not a neighbor. 

Suppose the actual entropies are given as input to Algorithm [TJ It can be shown in 
this case that for a given 6, there exists a -D^hresh such that if D > thresh' then the 
output of Algorithm [TJ will select the edge {0, D + 1} in the first iteration. This is easily 
understood because if D is large, the distribution of node is best accounted for by node 
D + 1, although it is not a neighbor. Thus, even with exact entropies, the algorithm will 
always include edge (0, D + 1), although it does not exist in the graph. 

The algorithm can however easily be shown to satisfy the following weaker consistency 
guarantee: given infinite samples, for any node in the graph, the algorithm will return a 
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super-neighborhood, i.e., a superset of the neighborhood of i. This suggests a simple fix to 
obtain a consistent algorithm, as we can follow the greedy phase by a 'node-pruning' phase, 
wherein we test each node in the neighborhood of a node i returned by the algorithm (to do 
this, we can compare the entropy of i conditioned on the neighborhood with and without 
a node, and remove it if they are the same). However the problem is complicated by the 
presence of samples, as pruning a large super-neighborhood requires calculating estimates 
of entropy conditioned on a large number of nodes, and hence this drives up the sample 
complexity. In the rest of the paper, we avoid this problem by ignoring the pruning step, 
and instead prove a stronger correctness guarantee: given any node i, the algorithm always 
picks a correct neighbor of i as long as any one remains undiscovered. Towards this end, 
we first define two conditions which we require for the correctness of Greedy Algorithm(e). 

Assumption 1 (Non-degeneracy) Choose a node i. Let N(i) be the set of its neighbors. 
Then 3e > such that V A C N(i), V j G N(i) \A and V I G N(j) \ {i}, we have that 

H(Xi | X A ) - H(X, | X A , Xj) > e and (3) 

H(X t | X A , X,) - H{Xi | X A , X 5 ,Xi) > e (4) 
Assumption [T] is illustrated in Fig. [2j 




Figure 2: Non-degeneracy condition for node i: (i) Entropy of i conditioned on any sub- 
neighborhood A reduces by at-least e if any other neighbor j is added to the 
conditioning set, (ii) Entropy of i conditioned on A and a two hop neighbor I 
reduces by at-least e if the corresponding one hop neighbor j is added to the 
conditioning set 



Assumption 2 (Correlation Decay) Choose a node i. Let N l (i) and N 2 (i) be the sets 

of its 1-hop and 2-hop neighbors respectively. Choose another set of nodes B. Let d(i,B) = 

mmd(i, j) , where d(i,j) denotes the distance between nodes i and j . Then, we have that 
jeB 

yxi,x N i^,x N 2^,x B 

\P(xi,x N i {i) ,x N 2^ I x B ) - P(xi,x N i^,x N 2^)\ < f(d(i,B)) 
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where f is a monotonic decreasing function. 



Assumption [I] (or a variant thereof) is a standard assumption for showing correctness 
of any structure learning algorithm, as it ensures that there is a unique minimal graphical 
model for the distribution from which the samples are generated. Although the way we 
state the assumption is tailored to our algorithm, it can be shown to be equivalent to 
similar assumptions in literature) Bresler et al. 2008). Informally speaking, Assumption [T] 
states that for node i, any 2-hop neighbor captures less information about node i than the 
corresponding 1-hop neighbor. In the case of a Markov Chain, Assumption [T] reduces to a 
weaker version of an e— Data Processing Inequality (i.e., DPI with an epsilon gap), and in 
a sense, Assumption [T] can be viewed as a generalized e— DPI for networks with cycles. 

On the other hand, Assumption [2] along with large girth implies that the information a 
node j has about node i is 'almost Markov' along the shortest path between i and j. This 
in conjunction with Assumption [T] implies that for any two nodes i and k, the information 
about i captured by k is less than that captured by j where j is the neighbor of i on the 
shortest path between i and k. It is also known (Bento and Montanari, 2009) that structure 
learning is a much harder problem when there is no correlation decay. 



4.2 Guarantees for the Recovery of a General Graphical Model 

We now state our main theorem, wherein we give a sufficient condition for correctness of 
Greedy Algorithm(e) in a general graphical model. 

The counter-example given in Section |4.1| suggests that the addition of spurious nodes 
to the neighborhood of i is related to the existence of non-neighboring nodes of i which 
somehow accumulate sufficient influence over it. The accumulation of influence is due to 
slow decay of influence on short paths (corresponding to a high 9 in the example), and 
the effect of a large number of such paths (corresponding to high D). Correlation decay 
(Assumption [2]) allows us to control the first. Intuitively, the second can be controlled if 
the neighborhood of i is 'locally tree-like'. To quantify this notion, we define the girth of 
a graph Girth(G) to be the length of the smallest cycle in the graph G. Now we have the 
following theorem. 

Theorem 2 Consider a graphical model G where the random variable corresponding to 
each node takes values in a set X and satisfies the following: 

• Non-degeneracy (Assumption [7]) with parameter e, 

• Correlation decay (Assumption^) with decay function /(•), 

• Maximum degree D. 

Define h = h(e,D) = e ^ 6 ^ — and suppose exists. Further suppose Gf (the 

factor graph of G) obeys the following condition: 

Girth(G f ) 4 9f > 4 (r 1 (h) + 1) . (5) 

Then, given 5 > 0, Greedy Algorithmic) recovers G exactly with probability greater than 1 — 5 
with sample complexity n = £ (e~ 4 log|), where £ is a constant independent ofp,e and 5. 
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The proof follows from the following two lemmas. Lemma [3] implies that if we had access to 
actual entropies, Algorithm [I] always recovers the neighborhood of a node exactly. Lemma 
[4] shows that with the number of samples n as stated in Theorem [2j the empirical entropies 
are very close to the actual entropies with high probability and hence Algorithm [T] recovers 
the graphical model structure exactly with high probability even with empirical entropies. 

Lemma 3 Consider a graphical model G in which node i satisfies Assumptions [i| and[S| 
Let the girth of Gf be gj > 4 (h) + l), where h is as defined in Theorem\A Then, 
V A C N(i), u £ N(i), 3je N(i) \ A such that 

H(Xi I X A , Xj) < H(Xi I X A , X u ) - ~ (6) 

Proof If A separates i and u in Gf it also does so in G. Then we have that P(xi\x A , x u ) — 
P{xi\XA) and hence H(Xi \ X A ,X U ) = H{Xi \ X A ). Then, the statement of the lemma 
follows from ([3]). 

Now suppose A does not separate i and u in Gf. Consider the shortest path between i 
and u in Gf \ A. Let j £ N(i) \ A and I € N(j) \ {i} be on that shortest path. Assumption 
[l] implies that H(Xi \ X A ,X{) - H(Xi \ X A ,Xj,Xi) > e. Now, choose B £ V such that 
AU B U {j} separates i and / in Gf and df(i,B) > where gf is the girth of Gf. 

Note that such a B (possibly empty) exists since the girth of Gf is gf and if a node in the 
separator is a factor node (i.e., not in V) then we can replace it by all its neighbors (in V). 
We then see using Lemma ljthat d(i, B) > 3/ 4 4 . From Assumption [2J we know that 

\P{%i,%N(i)uN 2 (i)) ~ P(xi,XN(i)uNi(i) I x B )\ < f (x - 1) 

\P(^XA,x J )-P(x i ,x A ,x j \x B )\<\X\( D+1 ff^-l)Vx B 

H{Xi,X A , Xj) - H&uX^Xi I X B ) < -l^l^ 1 ) 2 / (f - 1) (log / (f - 1)) 4 e 
(H(Xi I X A , Xj) + H(X A , X,)) - (H(Xi I X A , Xj,X B ) + H(X A , Xj \ X B )) < e 
=> H(Xi I X A , Xj) - H{Xi I X A , Xj, X B ) < e, 

where the first implication follows from marginalizing irrelevant variables and the second 
implication follows from ([T]). Using this we have that, 

H(Xi j X A ,Xj,Xi) > H(Xi j X A , Xj, Xi, X B ) 

Xa,Xj,Xb 

= H(Xi I X A , Xj , X B ) since X t il_ X l 
> H(Xi\X A ,Xj)-e 

Using a similar argument, we also have, 

H(X % I X A ,Xi,X u ) > H(Xi I X A ,X t ) -e 

Combining the two inequalities, and using the fact that under the given conditions e< |, 
we get 

H{Xi I X A , Xj) < H{Xi I X A ,X U ) - j. 
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Lemma 4 Consider a graphical model G in which each node takes values in X . Let the 
number of samples be 

n > 2 15 e" V| 4(D+2) ((D + 2) log 2\X\ + 2 log | 

TTien V i G G, wif/i probability greater than 1 — ^ ? we /mwe i/iaf V ^4 C N(i), u ^ -/V(i) 



< 



Proof We use the fact that given sufficient samples, the empirical measure is close to the 
true measure uniformly in probability. Specifically, given any subset A C V of nodes and 
any fixed xa G X\ a \ we have by Azuma's inequality after n samples, 



P(x A ) - P(x A ) 



>7 



< 2exp(-27^n) < 



25 



p 2 (2\X\)( D + 2 )' 



where 7 = 2 8 e 2 |Af| 2 ( D + 2 \ Let V be the set of all vertices. Now, by union bound over 
every A C N(i), u G V and a;^, xa, x u , we have 



> 7 



< 



P 



([!]) then implies 



I Xa, X u ) — H(Xi I Xa,X u ] 



> 



< 



P 



giving us the required result. ■ 

Using Lemmas [3] and [4j we have the following : V i G G, such that Assumptions [l] and [2] are 
satisfied, with probability greater than 1 — |, we have that V i C N(i), u ^ N(i), 3 j G 
iV(i) \ A such that 



I Xa,Xj) < H(Xi I Xa,X u ) 



(7) 



and V i G G, such that Assumptions [T] and [2] are satisfied, \/A C iV(i), j € iV(i) \ A, we 
have that 



H(Xi I X A ,X 3 )<H(Xi I X A ) 



(8) 



^ - -v--. . 2 

Proof [Theorem [2] The proof is based on mathematical induction. The induction claim 
is as follows: just before entering an iteration of the WHILE loop, N(i) C N(i). Clearly 
this is true at the start of the WHILE loop since N(i) = <!>. Suppose it is true just 
after entering the iteration. If N(i) = N(i) then clearly Vj G V \ N(i), H(Xi \ 
H(Xi I Xfi,*). Since with probability greater than 1 — * we have that 



X N(i) ' X 3 



H(Xi I Xft liV Xj) - H(Xi I Xfi^,Xj) 



also have that 



< I and 



< f , we 



H(Xi I X R(i) ) - H(Xi I X R{i) ) 

< f. So control exits the loop with 



H(Xi I Xft^yXj) - H(Xi I Xfr^) 

out changing N(i). On the other hand, if 3j G N(i) \ N(i) then from ([8j) we have that 
H(Xi I Xfi,q) — H(Xi I Xfi^yXj) > |. So, a node is chosen to be added to N(i) and 
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control does not exit the loop. Now suppose for contradiction that a node u ^ N(i) is added 
to N(i). Then we have that H{Xi \ X X u ) < H(Xi \ X^^,Xj). But this contradicts 

Thus, a neighbor j G N(i) \N(i) is picked in the iteration to be added to N(i), proving 
that the neighborhood of i is recovered exactly with probability greater than 1 — -. Using 
union bound, it is easy to see that the neighborhood of each node (i.e., the graph structure) 
is recovered exactly with probability greater than 1 — 5. ■ 



Remark 5 The proof for Theorem^ can also be used to provide node-wise guarantees, i.e., 
for every node satisfying Assumptions^ and^ if the number of samples is sufficiently large 
(in terms of its degree, and the length of the smallest cycle it is part of), its neighborhood 
will be recovered exactly with high probability. 

Remark 6 Any decreasing correlation- decay function f suffices for Theorem [3| However, 
the faster the correlation decay, the smaller the girth in the sufficient condition for Theorem 
[H needs to be. 

And finally we have a corollary for the computational complexity of GreedyAlgorithm(e) 
when executed on a graphical model that satisfies the conditions required by Theorem [2] 

Corollary 1 The expected run time of Algorithm^ is O (dnp 3 + (1 — 5)Dnp 2 ^). Further, if 
5 is chosen to be 0(p^ 1 ), the sample complexity n is O(logp) and the expected run time of 
Algorithm^is 0(Dp 2 logp). 

Proof For the second part, note that with probability greater than 1 — 5, the algorithm 
recovers the correct graph structure exactly. In this case, the number of iterations of the 
while loop is bounded by D for each node . The time taken to compute any conditional 
entropy is bounded by 0(n). Hence the total run time is 0(Dnp 2 ). Using the previous 
worst case bound on the running time, we obtain the result. ■ 



5. Guarantees for the Recovery of an Ising Graphical Model 

To aid in the interpretation of our results and comparison to the performance of other 
algorithms, we now specialize Theorem [2] to derive a self-contained result (i.e. we do not 
need to make additional use of Assumptions [T] and [2]) for the case of the widely-studied 
(zero-field) Ising graphical model 

Definition 2 A set of random variables {X v \ v G V} are said to be distributed according 
to a zero field Ising model if 

1. X v £ {-1,1} Vv G V and 

2. P{x v ) = \ Yl exp(6ijXiXj) 



Brush| 1967). We define it below for completeness: 
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where Z is a normalizing constant. The Markov graph of such a set of random variables is 
given by G(V,E) where E = {{i,j} \ Q%j / 0}. 

The following is a corollary of Theorem [2] We note that while it may be possible 
to derive a stronger guarantee for Ising models (this is also suggested by experiments), we 
focus on just applying Theorem[2]as is, and obtaining a set of transparent and self-contained 
conditions in terms of natural parameters of the model. 

Theorem 7 Consider a zero- field Ising model on a graph G with maximum degree D. 
Let the edge parameters 9ij be bounded in the absolute value by < j3 < \9ij\ < • Let 
e = 2 _10 sinh 2 (2/3). If the girth of the graph satisfies g > ^ {D 2 log2 - log (sinh2/3)} then 
with samples n = £e -4 log | (where £ is a constant independent ofe, 5,p), Greedy Algorithm(e) 
outputs the exact structure of G with probability greater than 1 — 5. 



The proof of this theorem consists of showing that such an Ising model satisfies Assump- 
tions [T] and [2j and the other conditions of Theorem [2| In Section 5.1, we show that under 



certain conditions, an Ising model has an almost exponential correlation decay. Then in 



Section 5.2, we use the correlation decay of Ising models to show that under some further 
conditions, they also satisfy Assumption [I] for non- degeneracy. Combining the two, we get 
the above sufficient conditions for GreedyAlgorithm(e) to learn the structure of an Ising 
graphical model with high probability. 

5.1 Correlation Decay in Ising Models 

We will start by proving the validity of Assumption[2]in the form of the following proposition. 



Proposition 3 Consider a zero-field Ising model on a graph G with maximum degree D 
and girth g. Let the edge parameters 9ij be bounded in the absolute value by \9ij\ < ^§^. 
Then, for any node i, its neighbors A rl (i) ; its 2-hop neighbors N 2 (i) and a set of nodes A, 
we have 

\P(xi,x N i^,x N 2 (i ) I x A ) - P(xi,x N i^,x N 2 (i ))\ < cexp ^-^|^min (d(i,A), | - lj^j 

V Xi,x^if{\,x N 2u\ and xa (where c is a constant independent of i and A). 

The outline of the proof of Proposition [3] is as follows. First, we show that if a subset of 
nodes is conditioned on a Markov blanket (i.e., on another subset of nodes which separates 
them from the remaining graph), then their potentials remain the same. For this we have 
the following lemma. 

Lemma 8 Consider a graphical model G(V, E) and the corresponding factorizable probabil- 
ity distribution function P. Let A, B and C be a partition ofV and B separate A and C in 
G. Let G(A U B,E) be the induced subgraph of G on AU B, with the same edge potentials 
as G on all its edges and P be the corresponding probability distribution function. Then, we 
have that P{xd \ xb) = P(xd \ xb) V xd,xb where D C A. 
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Now, for any node i, the induced subgraph on all nodes which are at distance less than 
| — 1 is a tree. Thus we can concentrate on proving correlation decay for a tree Ising model. 
We do this through the following steps: 

1. Without loss of generality, the tree Ising model can be assumed to have all positive 
edge parameters 

2. The worst case configuration for the conditional probability of the root node is when 
all the leaf nodes are set to the same value and all the edge parameters are set to the 
maximum possible value 

3. For this scenario, correlation decays exponentially 

The following three lemmas encode these three steps. For proofs, refer the Appendix. 

Lemma 9 Consider a tree Ising graphical model T. Let the corresponding probability dis- 
tribution be P. Replace all the edge parameters on this graphical model by their absolute 
values. Let the corresponding probability distribution after this change be P. Then, there 
exists a set of bijections 

{M v : {— 1, 1} —> {—1, 1} | v € V \ {r}} where V is the set of vertices and r is the root node 
such that, \/x r ,xy\ r we have that P(x r , xy\ r ) = P(x r ,M v (x v ),v G V \ r). 

Lemma 10 For a tree Ising graphical model T with root r and set of leaves L, we have 

(x r = 1, xl = 1) £ arg max \ P{x r \x£) — P(x r )\ 

And finally we have the following lemma. 

Lemma W In a tree Ising model, suppose \9ij\ < 7 < where D is the maximum 
degree of the graph. Then we have exponential correlation decay between the root node r, 
its neighbors A rl (r), its 2-hop neighbors N 2 (r) and the set of leaves L i.e., 

\P(x r ,X N i( r) ,X N 2 (r ) I X L ) - P(x r ,X N l( r ),X N 2 {r) )\ < C exp(- ~ |~d(r, L ) ) 

where c is a constant independent of the nodes considered. 



5.2 Non-degeneracy in Ising Models with Correlation Decay 

Now using the results from the previous section, we turn our attention to the question of 
non-degeneracy. In particular, we have the following lemma which says that if an Ising 
graphical model has almost exponential correlation decay and its edge parameters satisfy 
certain conditions, then it also satisfies Assumption [T] For the proof, refer the Appendix. 

Lemma 12 Consider an Ising graphical model with edge parameters Oij bounded in the 
absolute value by < (3 < < 7, max degree D, and having correlation decay as follows 

\P(xi,x N i^,x N 2^) - P(xi,x N i^,x N 2^\x B )\ < cexp I -amin I d(i,B), 
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V i, B, Xi ,x N x {i) , x N 2 (i) . If the girth g > 2 + \ {(2D + 11) log2 + logc + log (l + 2 D e 2 ^) + 

2y(D + 3) - log^l }, then this yrapkical m oM satisfies Assumption Q u,itk « = 
2 _7 e -6 7 D sin h 2 (2/3). 

Finally, the proof of Theorem [7] follows directly by combining Theorem [2j Proposition 
[3] and Lemma 12 For complete details, refer the Appendix. 

6. Simulations 

In this section, we present the results of numerical experiments evaluating the performance 
of our algorithm. We note that to satisfy the conditions so that our theoretical guarantees 
are applicable, the graph should have a large girth. However, it seems that the strong 
sufficiency conditions are a result of our analysis. In fact our algorithm seems to work well 
even on graphs with small girth. To demonstrate this fact we perform our experiments on 
graphs with small girth, e, which is an input to the algorithm is chosen empirically. 

In the first experiment, we evaluate our algorithm on grids of various sizes. Fig. [3] 
compares the sample complexity and computational complexity of our algorithm to those 



of (Ravikumar et al. 2010) which will be henceforth referred to as RWL. Note that RWL 
is specifically tailored to the Ising model, and leverages this to yield lower sample complex- 
ity. Ours is a generic algorithm that can be used for any discrete graphical model, and 
thus requires more (but comparable) number of samples. It can be seen however that our 
algorithm is much faster than RWL. 

Finally, we present an application of our algorithm to model senator interaction graph 



using the senate voting records, following (Banerjee et al. 2008). A Yea vote is treated as 



a 1 where as a Nay vote or absentee vote is treated as —1. To avoid bias, we only consider 
senators who have voted in a fraction of atleast 0.75 of all the bills during the years 2009 
and 2010. The output graph is presented in Fig. |4| 



7. Discussion 



We developed a simple greedy algorithm for Markov structure learning. The algorithm is 
simple to implement and has low computational complexity. We then showed that under 
some non-degeneracy, correlation decay, maximum degree and girth assumptions on the 
MRF, our algorithm recovers the correct graph structure with 0(e _4 log|) samples. We 
then specialize our conditions to prove a self-contained result for the most popular discrete 
graphical model - the Ising model. 

The success of our algorithm can be further improved by post-processing via pruning. In 
particular, as mentioned, the neighborhood of a node as estimated by our algorithm always 
includes the true neighborhood - but it may also include spurious nodes. The latter can be 
then identified by checking each node of the estimated neighborhood, to see if it actually 
provides a reduction in conditional entropy over and above all the other nodes. Analysis 
of the improvement achieved by such a procedure is more challenging, but it may be likely 
that doing so will reveal an algorithm that can handle much larger degrees and smaller 
girths. 
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Figure 3: Plots of (a) sample complexity and (b) computational complexity of our algorithm 
(GA) and that of (Ravikumar et al. 2010) (RWL) for various grid sizes. Edge 
parameters are all chosen to be equal to 0.5. X-axis represents the number of 
variables (9 for a 3 x 3 grid, 16 for a 4 x 4 grid and so on). In (a), Y-axis 
represents the sample complexity which is taken to be the minimum number of 
samples required to obtain a probability of success of 0.95. In (b), Y-axis is 
in logarithmic scale and represents the time taken in seconds for a single run 
using the number of samples from (a). All the above quantities are calculated by 
averaging over 50 runs. 
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Appendix 

We will first prove the lemmas required for proving Proposition [3] 

Proof [Lemma [9] The proof is by construction. For each node v & V, let M v (x v ) = n v x v . 
For the root node, let rj r = 1. For any other node v, let u be the parent of v in the rooted 
tree with root r. Define ri v = ^^riu. Let $ and <5 be the potential functions corresponding 

to P and P respectively. Then, 

$(xv) = jQ exp (9 uv x u x v ) 
uveT 

= TT exp ( \6 UV \ v^-rvlxuXv ) 
uver V Wnv\ J 

= Yl e W (\&uv\ riullvXuXv) 

uveT 

= Yl exp(\6 uv \M u (x u )M v (x v )) 

uv&T 

= $(x r , M v (x v ),v G V\t) 
Since the potential functions are preserved by the bijections, so are the probabilities. ■ 



We will first prove the following lemma which will help us in proving Lemma 10 



Lemma 13 Consider a tree Ising graphical model T with root r, set of leaves L and all 
positive edge parameters. Let P be its probability distribution. Then, the quantity P(X r = 
1 I Xi = xl) is monotonically increasing in x\, V I € L. Moreover, P(X r = 1 | Xl = 1) is 
monotonically increasing in 9{j V {i,j} € T. 

Proof For simplicity of notation, we define f{xz) — P(X r = 1 | Xl = xl)- Let us prove 
the above statement by induction on the depth of the tree. For a tree of depth 1, we have 
that 

JJ exp(6 r ixi) 



f(x L ) 



leL leL 



Xl ) 



Y[ exp(9 r ixi) +exp(-26 r jx~) JJ exp(-6» r; a;/) 

Since 9 j> 0, /(xl) increases when £7 is changed from —1 to 1. 

Now, suppose the statement is true for all trees of depth upto k. Consider a tree of 
depth k + 1, with root r. Let N(r) be the set of children of r. For every c E N(r), let T c 
be the subtree rooted at c with the same edge parameters as in T and L c be the leaves of 
T c . Let P c be the probability measure corresponding to T c and f c (xL c ) — Pc{x c = 1 | xl c ). 
Then, the conditional probability of the root node can be written as 
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n (exp(M/ c (x L J + exp(-^ rc ) (1 - f c {x Lc ))) 

f{XL) = g 

where 

B= J] (ew(Orc)fc(x Lc ) + exp(-e rc )(l- f c (x Lc ))) + 

c£N(r) 

J] (exp(-e rc )f c (x Lc ) + exp(6 rc ) (1 - f c (x Lc ))) 

cGW(r) 

Q can now be manipulated to obtain (10). 



^ XL) K ,rf gzfe)+exp(20,. s ) ( 10 ) 
Al + ^ 2 <te(zs)exp(2(^)+l 

where gzixz) = ^J f __/„ c % , and K\ and K2 > are independent of x l~ and Q T z- Since K2 > 

1 Jc\ x L~) c 

and 6* r c > 0, /(xl) increases if fc(xL~) increases. So, for any leaf node, if its value changes 
from —1 to 1, the corresponding fz{xL z ) increases and hence f(xi) increases, proving the 
induction claim. 

Using the same induction argument as above and noting that /(xl = 1) > 5, it can be 
seen that f(xi = 1) is monotonically increasing in 9ij V{z,j} E T. ■ 



Proof [Lemma 10 We know that P(x r ) = \ for x r = ±1. Clearly any xl that maxi- 
mizes \P(x r I xl) — P(x T )\ should either minimize or maximize P(x r \ xl). Note also that 
there is a one-one correspondence between such configurations (i.e., for every maximizing 
configuration, there exists a minimizing configuration such that both of them maximize 



\P(x r I xl) — P(x r )\). From Lemma 13, we know that xj_, = 1 maximizes P(x r = 1 | xl) 
and by symmetry this should be the same as P(x r = —1 | xl = —1) and equal maxP(x r = 

X L 

—1 I xl). So, we can conclude that \P{x r \ xl) — P(x r )\ is maximized by (x r = 1,xl = 1). 



Lemma 14 Consider a tree Ising model T with root node r, set of leaves L and maximum 
degree D. Let P be its probability measure. Suppose the absolute values of the edge param- 

< V {i, j} G T. Then, we have that \P{x r \ xl) — P(x r )\ < 



eters are bounded by \9ij\ 
exp(— ^§^-d(r, L)) Mx t ,xl 



Proof Using Lemmas|, 10 and 13 we can assume without loss of generality that the pa- 



rameters 9ij on all the edges are positive and equal to ^57^ (which is the maximum possible 



value), consider a complete D-ary tree and concentrate on \P(X r 



For simplicity of notation, let 6 
We have that 

a(d+l) 



A log 2 
2D 



For a tree of depth d, let a(d) 



X L = l) 
P(X r = 



1 I -A-L 

A 



- P(X r 
l\X L -- 



= 1)1 
1). 



(exp(0)a(d) + exp(-0) (1 - a(d))) 



D 



(exp(0)a(d) + exp(-0) (1 - a(d))) D + (exp(-0)a(d) + exp(0) (1 - a(d))) 



D 
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Using some algebraic manipulations and substituting the value of 9, we obtain 



a(d + 1) 



1 



< exp 



log 2 



a(d) 



and the result follows. 



We need the following lemma to prove Lemma 11 



Lemma 15 Consider a tree Ising model T , with root node r, set of leaves L and maximum 
degree D. Let P be its probability measure. Suppose the absolute values of the edge param- 
eters are bounded by < V {i,j} G T. Then, Vc such that c is a child of r, we have 
that \P(x c | Xt,xl) — P(x c | x r )\ < 4exp(— ^|^d(r, L)) Vx r , Xj,XL- 

Proof Using Lemma [9] we can assume without loss of generality that the parameters Oij 
on all the edges are positive. (x c ,x r ) can take values (dbl,dbl). For each of those values, 
the value of xl that maximizes \P(x c \ x r ,XL} — 

P(x c | x r )\ either maximizes or minimizes P(x c \ x r ,XL). Noting from (a slight extension 
to) Lemma 13 that P(x c \ x r ,xi) is monotonic in xl, it suffices to consider the eight pos- 
sibilities \P(X C = ±1 | X r = ±1,X L = ±1)- 

P(X C = ±1 | X r = ±1)|. We show how to calculate the above value for x c = l,x r = 1,xl = 
1. Interested readers can check that the conclusions below apply to all the other cases as 
well. Using Lemma 13, we can assume that the parameters Oij on all the edges except 
the edge {r, c} are equal to and consider a complete D-ary tree. Let 9 = 6 rc - We 



1) 



exp(6>) 
cxp(0)+exp(-0) ' 

We have b(d) 



Let d be the depth of the tree and 



cxp(fl)a(rf-l) 



where 



know that P(X C = 1 | X r = 1) 

b(d) = P(X C = 1 | X r = 1,X L = xj. vvc m»c uyu, - e xp(0)a(d-l)+cxp(-0)(l-a(d-l)) 

a(d) is as defined in Lemma [14} Using some algebraic manipulations, it can be shown that 

exp(g) 



b(d) 



cxp(#)+exp(- 



< 2 o(d — 1) — | . Using Lemma 
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finishes the proof. 



Proof [Lemma [TT] Using Lemma 15, we have 

\P(x r , aJjVi(r)j x N 2 (r) I x l) ~ P ( x r , % N 1 (r) ■> x N 2 (r))\ 

P(x r \x L ) ] J P(Xj\x r ,X L ) Yl P ( x k\xj,X L ) 

jeN~>-(r) k£N 2 (r) 

P( x r) [J I Xr ) II P( - Xk I X - 

j&N 1 ^) k&N 2 {r) 



< 2 D2+3 exp I 



cexp 



(d(r,L)-lj) 



'-d(r,L) 



proving the result. 



Proof [Proposition [3] Let I = {i} U N 1 ^) U N 2 (i). Let B be a set that separates I and 
A such that d(I, B) = min(<i(i, A), | — 1). Let J be the component of nodes containing / 
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when the graph is separated by B. We know that the induced subgraph on J U B is a tree. 



Applying Lemma 11 on this tree and using Lemmapl we obtain \P(xj \ xb) — P(xj \ xb)\ < 



2cexp(— ^^-d(I, B)) \/xj,xb,xb- Since P(xi) is a weighted average of P(xj \ xb) for 
various x^, we have 

\P(xi | x B )-P(x I )\ < 2cexp(-^d{I,B)) V Xl ,x B 
The result then follows since P(xj \ xa) is a weighted average of P(xj \ xb)- ■ 



Proof [Lemma 12 Let the graphical model be denoted by G(V, E), <3?(xj, Xj) = ex.p(9ijXiXj) 
denote the potential on edge {i, j} when Xi = xi and Xj = Xj and <&(xa) denote the poten- 
tial due to all edges with both vertices in A when Xa = xa, V^4 C V. In the following, we 
assume that the girth of the graph is g > 4. Consider a node i and a subset of its neighbors 
jli ■ ■ " ) jki z an d a node w which is a neighbor of z. We know that the pairwise potentials 
satisfy exp(-7) < $(xi,xj) <^exp(7). Let E = E \ {{i, jx}, ■ ■ ■ , {i, j k }, {i, z}, {z, w}} 
and consider the graph G(V, E) with the same potentials on all edges as in G. Let 
A = ,jk,z,w} and choose any other set B C V. Let P and P be the proba- 

bility mass functions corresponding to G and G respectively. Similarly let d(i,j) and d(i,j) 
be the distance between i and j in G and G respectively. Suppose further that d(i, B) = d. 
Then, d(i,B) > d(A, B) = d. Note that, 

¥>, \ 1 P(x A ,X B ) Ml s 

P(xA,X B ) = J ^ (x - r (11) 

where Z is an appropriate normalizing constant. Note that -g > — ; r- = > P(xa) = 1- 

It follows from this that exp(— 7) < i < exp(7). Using ( |llj ), the hypothesis that an Ising 
model has almost exponential correlation decay, we obtain the following inequalities after 
some algebraic manipulations, 

\P(xa,x b )-P{xa)P(xb)\ < c2 D+3 exp(4 7 ) exp(-aniiii(d, ^-))P(x B ) (12) 
P{x B ) > exp(-2 7 ) (\ - 2 D+2 cexp(-amin((f, P{x B ) (13) 



Vxa, x B - Combining (12) and (13), we obtain 



\P(xa,x b ) - P(xa)P{xb)\ < c2^ 3 exp(67) i -gg g^L^j) 
and subsequently by marginalizing, we obtain 

\P( Xi ,x B ) - P{x t )P(x B )\ < c2^ exp(6 7 ) - f^"" ^ ^zgrr P(*b) 

1 — z^cexp^-aminfa, 2~)) 
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Let A' = A \ {i}. Since d(i, A') = 2, we have that d(i, A') > g - 2. So, 3 B C V separating 
i and A' in G such that 5) > Then, V Xi, xa' 



\P{Xi | XA') - P(Xi 



< c: 



2^ exp(67) _f^ 



exp(— a 2 ^) 



exp(— 2^(D + 1)) |sinh(2/3)| = e 



(14) 



where the last inequality follows from the lower bound on girth g in the hypothesis. 

Now consider the graph G(V, E) where E = {{i,ji}, ■■■ , {i, jk}, {i, z}, {z,w}}. Let the 
potentials on the edges in G be the same as those in G and denote the corresponding 
probability mass function by P. Clearly, we have the following relation between P, P and 
P. 

P(x A ) = ^P(x A )P(x A ) V x A 



where Z is an appropriate normalizing constant. Using (14) and the symmetry of the Ising 
model (i.e., P(xi) = \ for Xi = ±1), we obtain 



P(xi | X A' 



< 



P(Xj,X A ,) 

P{x A ,\ 

^P(xi,x A ,)P(xi,x A ,) 

i/^2P{xi,x A ')P{xi,x A ') 

P(Xi,X A ,)P(Xi,X A ,) 

S ^P{x i | X A ')P(xA')P(Xi,X A ' 
P(xi\x A ,)P{xj\x A i) 



< \ ± §P(X 1 | X A ' 



after some algebraic manipulations. Similarly, we also have 



P(xi | x A >) > \_^2e P(yXi ' XA ' 



which implies 



P(Xi | xa') — P(xi | x A 1 ) 



< 8e 
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Finally, letting A* = A' \ {z}, we have, 
H(Xi | X A .)-H(Xi | X A .) 

X j^t X^ 

= Y J P{*A>)D{P{X i | XA,)\\P{Xi \x A ,)) 

x A , _ 
^ 2\k2^2 P ( X A')^2\ P ( x i I X A /)-P(xi | X A «)\ 2 

X j^i X'l 

= IJ2 P ( X A*)J2 P ( X * I x A*)J2\ P ( x i I SA')--P(a;i I *A*)\ 2 

X A* X z Xi 

> I y~] P(x A *)minP(x z | x A *)-\P(xi \ x A *,x z = -1) - P(xi \ x A *,x z = 1)| 2 

> 1 V P(r,,.) eM-lD) 

(max (o, \P(xi | = -1) - P{%i \ x A *,x z = 1) - 16e)) 

> ie pi'M-w ( ^m^tm _ 1K ) 2 

> I ig'exp(-67 J D)sinh 2 (2/3) 

So, we have shown that under the given conditions, an Ising model satisfies Q with e = 
jTjg exp(— 67!?) sinh 2 (2/3). It is straightforward to note that the above proof can also be 
used to show that the Ising model also satisfies Q with the same e, completing the proof 
of the lemma. 



Proof [Theorem [7] The theorem follows directly from Theorem [2j Proposition [3] and Lemma 
□2J ■ 
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